Spectral methods cluster words of the same class in a syntactic dependency network 
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We analyze here a particular kind of linguistic network where vertices represent words and edges 
stand for syntactic relationships between words. The statistical properties of these networks have 
been recently studied and various features such as the small-world phenomenon and a scale-free 
distribution of degrees have been found. Our work focuses on four classes of words: verbs, nouns, 
adverbs and adjectives. Here, we use spectral methods sorting vertices. We show that the ordering 
clusters words of the same class. For nouns and verbs, the cluster size distribution clearly follows a 
power-law distribution that cannot be explained by a null hypothesis. Long-range correlations are 
found between vertices in the ordering provided by the spectral method. The findings support the 
use of spectral methods for detecting community structure. 

PACS numbers: 89.75.-k, 89.20.-a 
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I. INTRODUCTION 

A great amount of efforts has been recently devoted 
to the study of complex networks El 0L The topology 
of systems as different as the Internet H , the World 
Wide Web HHE1, biological [3 and social systems 
have been found to share similar statistical properties. 
Many networks display both a distribution of distances 
(defined as the minimum number of edges between two 
vertices) peaked around a small characteristic value (i.e. 
small world E3 effect) and the degree (denned as the 
number of links per vertex) is distributed according to 
a power law function. Driven by the widespread pres- 
ence of these common features, much work has been done 
in order to understand the basic mechanisms underlying 
such universality EL El El El 

Instead, we focus on a 
further characterization of networks using spectral meth- 
ods that have been used for detecting communities in 
complex networks |3 El El 113 • Community detection 
technicmes represent now a flourishing, inter-disciplinary 
field El El El El El El El E|. The aim of the present 
paper is to show how spectral methods 0, 0, 0, EJ 
cluster four classes of content words: verbs, nouns, ad- 
verbs and adjectives in a Syntactic Dependency Network 
Ejjj] (SDN), which is a kind of linguistic network. The 
topology of various instances of linguistic networks has 
recently been under consideration by several studies. Ex- 
amples include thesaurus networks based on the Roget's 
thesaurus 0, El Eland networks based on Merrian- 
Webster's thesaur us El , WordNet El 28], word associa- 
tion networ ks JliL l26l|. word co-occurrence networks E^| , 
and SDNs |25|. The latter, as said above, will be the 
target of the present article. 

SDNs are constructed by collecting the syntactic de- 
pendency links from a corpus, i.e. a set of sentences. 
Syntactic links are defined according to the dependency 
grammar formalism |30l | , a special case of a broad family 
of grammatical formalisms pH l32| . Dependency gram- 
mar assumes that the syntactic structure of sentences 
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FIG. 1: A) The syntactic structure of a simple sentence. 
Words are the vertices and the syntactic dependencies are the 
edges of the graph. The proper noun 'John' and the verb 'has' 
are syntactically dependent in this sample sentence. 'John' is 
modifier of the verb 'has', which is its head. Similarly, the 
action of 'has' is modified by its object 'apples'. Here we 
assume the graph oriented with edges pointing from a modifier 
to its head. B) Mapping the syntactic dependency structure 
of the sentence into a global syntactic dependency network. 



consists of vertices (words) and word pairwise connec- 
tions (syntactic dependencies). In this approximation, a 
dependency link connects a pair of words representing 
an edge of the graph. In the simple sentence "John has 
apples" , "John" is linked to "has" and "apples" is linked 
to "has". As explained in Fig. ^ we can assign a di- 
rection to this edges. "John" has the function to modify 
the meaning of the word "has" so "John" plays the role 
of modifier and the word "has" is referred as head. Most 
of edges are then directed and we assume that arrows 
go from the modifier to its head (other conventions may 
make the opposite choice). In some cases, such as in coor- 
dination, there is no clear direction |3^ |. But since such 
cases are rather uncommon, we will assume that every 
link has a definite direction, by assigning an arbitrary 
direction to the undirected cases. 

We define a SDN as a set of n words labeled with natu- 
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rals from 1 to n. Syntactic dependencies are specified by 
the adjacency matrix A — {a^ ■}. ay = 1 if the i-th word 
points to the j-th word, and a U j = otherwise. In fact, 
a,;j = 1 are connected if the the i-th has pointed to the 
j-th word at least once in a sentence from a corpus. The 
syntactic dependency structure of a sentence can be seen 
as a subset of vertices and links contained in a global net- 
work (Fig. ^B). Syntactic dependencies between words 
in a sentence tend to be local and the distance between 
syntactically linked words in a sentence decays exponen- 
tially 0, |35|. The organization of the rest of the pa- 
pers can be summarized as follows. Section ^ gives a 
brief account of community detection techniques, with a 
special emphasis on spectral methods and the way they 
will be applied here. Section ITTT1 describes the sources of 
the data and the statistical measures that will be used to 
show how the spectral methods cluster words of the same 
class. Section Hvl presents the results. A discussion of the 
findings and some conclusions are reported in section Ivl 



II. THE COMMUNITY DETECTION 
TECHNIQUE 

Various techniques for detectin g co mmunity structure 
have been proposed recently [3 111 111 ES El El 
Izif . Some of them [T^ . l3^ | use the same arguments under- 
lying the algorithm introduced by Newman and Girvan 
|20| (NG-algorithm), and they focus on the value of the 
edge betweenness. This quantity measures the fraction 
of all shortest paths passing over a given link, or, ac- 
cording to an alternative definition, the probability that 
a random walk on the network runs through that link. 
By removing edges with large betweenness, one splits the 
whole network (step by step) into connected components. 
The process goes on until the whole graph is decomposed 
in communities consisting of one single node each. The 
method is very efficient whenever some clues on the com- 
munity structure is at hand. Otherwise, it does not give 
an indication of the resolution of the clustering. There- 
fore it needs extra information in input (like the expected 
number of clusters). Furthermore its outcome is indepen- 
dent on how sharp the partitioning of the graph is. 

To overcome such problems we adopted a different ap- 
proach based on the spectral analysis [l4|- That ap- 
proach conjugates the power of spectral analysis with the 
caution needed to reveal an underlying structure when 
there is no clear cut partitioning, as is in the case of the 
SDN considered. 

Spectral methods are based on the analysis of simple 
functions of the adjacency matrix A [TBI. [Ta. fT^| . In par- 
ticular, the most widely investigated function of A are 
the Laplacian matrix L = K — A, the Normal matrix 
N = K~ X A and the composition AA T , particularly use- 
ful when dealing with oriented graphs. In the quantities 
defined above, we have assumed that K is the diagonal 
matrix with elements ku = X^=i a u an< ^ 71 is the number 
of nodes in the network. In most approaches concerning 



undirected networks, A is assumed to be symmetric, in 
contrast with the present case, which explains our em- 
phasis on the matrix AA T . 

Just to familiarize the reader with the concept of the 
connection between spectral analysis and communities, 
let us consider the simplest case among those described 
above, the matrix TV. Elements in a row can be inter- 
preted as probabilities that a walker moves from a given 
vertex to the other ones, since those elements sum up to 
one. By such a probabilistic approach, it is evident that 
the largest eigenvalue of the matrix N is always equal 
to one. This eigenvalue is associated to a trivial con- 
stant eigenvector, due to row normalization. In a network 
with an apparent cluster structure, TV has also a certain 
number m — 1 of eigenvalues close to one, where m is 
the number of well defined communities, the remaining 
eigenvalues lying a gap away from one. We denote by Xi 
the position of the i-th vertex after sorting vertices by 
the value of its corresponding component in one eigen- 
vector. The eigenvectors associated to these largest to— 1 
eigenvalues have a characteristic structure too: the com- 
ponents corresponding to nodes within the same cluster 
are assigned very similar Xi values so that, as long as the 
partition is sufficiently sharp, the profile of each eigen- 
vector, sorted by components, is step-like. The number 
of steps in the profile gives again the number m of com- 
munities |14|. A similar information is encoded in the 
non-negative definite Laplacian matrix, where the eigen- 
values close to zero are associated to clusters. While one 
can easily show that these spectral properties are asso- 
ciated with the clustering pattern in the Normal matrix 
case, this is less evident for the other cases. Neverthe- 
less, it has been shown [lg that the spectral structure 
of the AA T helps in detecting sets of highly mutually 
connected nodes in directed networks, thereby indicat- 
ing the presence of communities in the network such as 
the World Wide Web. We will use the same empirical 
evidence here for the analysis of our word dataset. 

As explained in [14j. solving the eigenvalue problem 
is equivalent to minimization of a suitable function of 
the XiS under a suitable constraint. The absolute mini- 
mum corresponds to the trivial eigenvector, which is con- 
stant. The remaining stationary points correspond to 
eigenvectors where components associated to well con- 
nected nodes assume similar values. 

For the present aim, it is enough to mention that (as 
for general real networks) the typical eigenvector profiles 
in our SDN are not step-like, but rather resemble a con- 
tinuous curve. Nevertheless, the method still applies. In 
fact, we will see that components corresponding to nodes 
belonging to the same class (or equivalently, that do not 
belong to a certain class) are still strongly correlated and 
take, in each eigenvector, similar values among them- 
selves. Thus, a natural way to identify communities in 
an automatic manner is to measure the correlation 

_ ( x i x j) ~ ( x i)( x j) /|\ 
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where the average (•} is computed over the first few non- 
trivial eigenvectors. The quantity measures the com- 
munity closeness between the vertices i and j. Though 
the performance may be improved by averaging over a 
larger number of eigenvectors, with increased compu- 
tational effort, we found that indeed a small number 
of eigenvectors suffices to measure the correlation be- 
tween two vertices, which is positively correlated with 
the chance that the two networks are strongly correlated 

The spectral method is used in a different manner here. 
The class each node belongs to is known in advance; we 
check, then, whether the spectral method clusters words 
of the same class, once they are sorted by their compo- 
nent in the eigenvector. Since we know that the spec- 
tral method detects the community structure, clusters 
are also likely to correspond to communities. With this 
aim, we define S = {1, i, n} as a sequence of vertices 
where n is the number of vertices of the syntactic depen- 
dency network. We assume that S is the outcome of tak- 
ing one eigenvector and sorting the vertices by the value 
of the corresponding components in the eigenvector. We 
define C = {ci, Cj, c n }, a boolean vector indicating, 
for every vertex in S, whether it can be of the class un- 
der consideration or it cannot. More precisely, we have 
Cj = 1 if the i-th vertex of S can be of class i and Cj = 
otherwise. The correlations we are interested in are not 
computed between values in X — {x%, ...,Xi, ...,x n } as in 
14] but between values in C. 



III. METHODS 

For the spectral analysis, we will consider two different 
matrices, i.e. A and AA T . We use AA T because it is a 
way of obtaining equivalent vertices. Here equivalence 
means playing the same role in the network, i.e. point- 
ing approximately to the same set of nodes. With this 
matrix, we are following the same approach as in |16| . A 
is used for two reasons: simplicity and as null hypothesis 
for AA T . The methods in |15l fl7| do not use directly A 
but close variations such as the Laplacian or the Normal 
matrix. The main difference between the present appli- 
cation of the spectral technique and the one performed 
in 0] are the following. First, we use simple adjacency 
matrices instead of matrices with edge weights. Second, 
we do not normalize the matrices. Normalizing the ma- 
trix is problematic here because most of verbs have no 
outgoing link (recall the convention adopted here is that 
arcs go from modifiers to heads). This would introduce 
special vertices in the graph without outgoing links. This 
would affect the statistics of the whole system, and this 
is why we did not normalize our adjacency matrix. Be- 
sides, nodes without outgoing links would mean trivial 
stationary solutions for a random walk on the network. 
This means that a walker moving on the graph would sys- 
tematically get stuck in these "sink" vertices. Another 
solution pursued in |l4j is to remove vertices with no in- 



going or out-going links (pruning). Nevertheless, here 
pruning would mean a drastic reduction of verbs which 
may preclude the analysis of that class. 

We introduce now various measures of the degree of 
clustering of words of a given class in C. We consider a 
special case of C, C scram u e d, sharing the same composi- 
tion of C but with scrambled components. The measures 
obtained for C scra mbied will be used as a null-hypothesis 
for those obtained for C. Significant clustering cannot 
be claimed unless the clustering obtained with C and 
Cscrambied differ clearly. Scrambled sequences have been 
used to test the significance of long-range correlations 
in DNA sequences [33. We used the first non-trivial 98 
eigenvectors, so that we have 98 C vectors for each class. 

We define A: as a random variable measuring the length 
of a cluster of Is in C. A cluster is a maximal se- 
quence of consecutive elements equal to one in C. If 
C = {0,0,0,1,1,1,0,0,0,0,1,0,1,1,1}, we have two 
clusters of length 3 ('111') and one cluster of length 1 
('1'). We define (k) as the mean value of k in C. 

We define P(k) as the probability that a cluster has size 
k. We will estimate P(k) from the proportion of clusters 
of length fc in C. k is expected to follow a geometric 
distribution in C scram bied- There, we have 



P(k) = (c i ) k (l-(c i )), 



(2) 



where (c,-) is the mean value of the components of C . In 
other words, (c,) is the proportion of ones of C. From 
equation |3 it follows that 



(fe) = 1/(1 - ( Ci » 
and the standard deviation is 

a(k) = (c i ) l ' 2 /(l-(c i )). 



(3) 



(4) 



Since the estimated P(k) may contain a considerable 
amount of noise, we will use, P>(k), the cumulative P{k), 
defined as 



P>(k) = P(K). 



(5) 



K>k 



We are also interested in measures of the correlations 
between q and Ci+d, with d > 0. If the community de- 
tection clusters words of the same community consecu- 
tively in C 0], correlations above the expected value 
in the scrambled sequence are expected. Here we will 
use two measures of correlation: T(d) and 1(d), that are, 
respectively, the Pearson correlation coefficient and the 
information transfer, between vertices at distance d in C 
[37]]. r(d) and 1(d) are complementary measures. T(d) 
measures positive and negative correlations, whereas 1(d) 
cannot distinguish positive from negative correlations. 
1(d) captures non-linear correlations that T(d) does not 
detect j3g. 1(d) is apparently more sensitive to finite 
sampling than T(d). Here, T(d) is defined as 



T(d) 



COV(c t ,c l+d ) 
a(ci)a(c i+ d) 



(6) 
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where 1 < d < n. 
tween Cj and c i+dl 



COV(ci,Ci + d) is the covariance be- 
a(ci + d) is the standard deviation of 



Ci + d and a(ci) = a(ci+d) with d = 0. We have 



(d+d) 



-d 



n — d 

£< 



The covariance is defined as 



COV{c t ,c l+d ) 



n-d , 

£- 

i=l 



(Cj))(c 



(7) 



(8) 



The standard deviation is defined as 

a(c l+d ) = ({cj +d ) (c l+d f)^ 

as usual. Given that (cf) — (a) because C 
sequence, it follows 

a(c l+d ) = (1 - {c, l+d ))) 1 ' 2 . 



n — d 



m = £ 



(ci - (ci))(ci+d - (c i+ d)) 



- (n - d)((( Ci ) (1 - {*)){{*+*) (1 - (c i+rf )))V2 ' 

(11) 

Here, 1(d) is defined as 



1(d) = £ p(ci = x, c i+d = y) log 
x,2/£{0,l} 



Type 



Number Proportion 



Verbs 

Nouns 

Adverbs 

Adjectives 

Other 



985 
3093 

171 
1129 

337 



0.17 
0.55 
0.03 
0.20 
0.06 



TABLE I: Counts of the frequency of very word class. The 
total amount of different words (i.e. vertices in the syntactic 
dependency network) is 5563. A word counts for a certain 
class if it has appeared at least once for that class in the 
Romanian corpus studied in |2Sjl . Since a word can be of 
different classes, the sum of the column 'Number' does not 
necessarily equals the number of different words. 



(9) 


Class 


Cluster size 


A 


AA T 


Scrambled 




Verbs 


(k) 


2.01 


2.08 ±0.32 


1.21 ±0.51 


a binary 




max 


43 


24.96 ± 20.87 


4.72 ± 0.84 




Nouns 


(k) 


2.66 


2.68 ±0.13 


2.25 ± 1.67 


(10) 




max 


32 


34.35 ± 10 


13.73 ± 2.04 


Adverbs 


(k) 


1.06 


1.08 


1.03 ±0.18 






max 


2 


3 


2.16 ±0.39 


obtain 


Adjectives (k) 


1.39 


1.53 


1.25 ±0.56 






max 


10 


8.14 ±0.83 


5.11 ±0.80 



TABLE II: Cluster sizes for the four classes of words, the 
two types of matrices considered (A and AA T ) and scrambled 



vertex orderings. Standard errors smaller than 10 
shown. 



are not 



where 



pfa = x, c l+d = y) = 



p(cj = x, c i+d = y) 
p(d = x)p(c l+d = y) ' 

(12) 



1 



n — d 



71 — d 



E 

i = 1 

c, = x and Cj = y 



1, (13) 



is labeled under that class at least once in the corpus that 
originated the syntactic dependency network in [25j . Ta- 
ble [I] shows the number of words in each class according 
the previous definition. Only 6% of words cannot fall 
in any of the previous classes. Participles and infinitives 
where excluded from the class verb, as it was done in 
the original corpus. Those words represent a very small 
fraction of the vertices. 



p(d = x) = ^2 P( c i = x > °i+d = v) (14) 
ye{o,i} 

and 

p(ci+ d = y) = ^2 p(ci = x, a +d = y). (15) 
xe{o.i} 

Now we study the Romanian syntactic dependency 
network described in (2{| with n — 5563 vertices. We 
choose that network from the set of three networks stud- 
ied in |23 because it is the most complete. The other net- 
works systematically lack several syntactic dependencies. 
The network is a small-world one with significantly high 
clustering, has a power-law distribution of vertex degrees 
and exhibits disassortative mixing. Those properties are 
shared with other non-linguistic biological networks |25| . 
As in [25j |. we worked on the largest connected compo- 
nent. A given word belongs to a given class if that word 



IV. RESULTS 

Fig. |2] shows some examples of C for nouns, adverbs 
and adjectives. Large clusters can be visually identified. 
Table ITT1 gives mean (k) and the maximum value of the 
cluster size over the sample set of eigenvectors. 

Fig. |21 shows the mean P(k) over the sample set of 
eigenvectors for the two kinds of matrices and the null 
hypothesis. For verbs and nouns, P (k) obeys 

P(k) - fc-T fe (16) 

for sufficiently large k. The agreement with equation 1161 
is lower adopting A rather than AA T . Verbs and nouns 
cannot be given account of by the geometric distribution 
predicted by the null hypothesis, supporting thus the sig- 
nificance of the results. No conclusion can be made on 
adverbs, since they are not represented enough and it 
seems unlikely that adjectives follow equation 1161 
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FIG. 2: Visual representation of the some binary vectors C = {ci} delivered by the spectral methods for different word classes: 
verbs, nouns, adverbs and adjectives, a = 1 if the i-th vertex (or word) belongs to the class under consideration and Ci = 0, 
otherwise. C has 5563 components, which is the number of vertices of the SDN. C is the outcome of ordering vertices by the 
the corresponding values of the eigenvectors delivered by the spectral methods and replacing the vertex by 1 if it is of the class 
under consideration or by otherwise. White and black indicate, respectively, a = 1 and a = 0. Two types of C are shown 
for each word class: (A) an ordering using A and (B) ordering using AA T . 



Figure 0] and [S] show T(d) for A and AA T , respectively. 
Fig. and [7\ show 1(d) for A and AA T , respectively. 
In order to determine the length of the correlations for 
a certain correlation measure in a conservative way, we 
define two series: B^(d) and Bjj(d). B^(d) is the mean 
value of a real correlation for distance d over the sam- 
ple of eigenvectors minus the corresponding standard de- 
viation. B^(d) is an approximate lower bound for the 
real value of the correlation. Bjj (d) is the mean value of 
the null hypothesis series over the sample of eigenvectors 
times the corresponding standard deviation. Bjj(d) is an 
approximate upper bound for the null hypothesis. Since 
real correlations tend to decrease with length (Figs. 0] 
[5] and [B] and [JJ , an approximate conservative measure of 
the length of statistically significant correlations is d* , 
the smallest value of d at which B^(d) and Bjj(d) cross. 
Long-distance correlations where found (Table IIII|) , ex- 
cept for adverbs, due to the small amount of vertices that 
can be adverbs (recall Table QJ. 



V. DISCUSSION 

In the previous sections we have obtained three main 
results. First, the spectral methods clusters significantly 



Class 


Correlation 


d* 


A A A 1 


Verbs 


V(d) 


1164 915 




1(d) 


872 683 


Nouns 


r(d) 


50 63 




i(d) 


37 13 


Adverbs 


F(d) 


6 6 




i(d) 


1 1 


Adjectives 


T(d) 


148 881 




1(d) 


36 648 



TABLE III: d* , the approximate length of statistically sig- 
nificant correlations using two different correlation measures 
measures: the Pearson correlation coefficient L(d) at distance 
d and the information transfer at distance 1(d). Four classes 
of words are considered. 



words of the same class. Second, those spectral methods 
sort vertices in a way such that long-range correlations 
appear in the binary vector of membership to a certain 
class. Third, AA T clusters word classes better than A. 
The chances of getting a large cluster are higher using 
AA T (Fig. [3J. The results presented here confirm the 
power of the spectral methods introduced in |14| for de- 
tecting community structure in linguistic networks. Us- 




FIG. 3: Cumulative P(k) where P{k) is the proportion of clusters of length k. Results using A (circles), AA T (squares) and 
the null hypothesis (dashed line) are shown. A clear power trend is found for the series of AA T for verbs and nouns from the 
each arrow to the right. 



ing a priori information about class membership of every 
vertex, we have discovered that the spectral method clus- 
ters words consistently with the word types that linguists 
have been distinguishing from a long time ago. 

Discovering word classes minimizing the a priori 
amount of linguistic knowledge is a challenge in linguis- 
tics. Recently, Montemurro & Zanette [3!j have clustered 
words using only information about the degree of hetero- 
geneity with which words are distributed throughout a 
text. In particular, they have found that nouns and ad- 
jectives tend to be more heterogeneously distributed than 
verbs and adverbs. In a syntactic network, we have found 
that verbs and nouns and significantly more heteroge- 



neously distributed according to the ordering provided 
by spectral methods than adverbs and adjectives. The 
present work suggests that word classes could be even- 
tually discovered using only the structure of syntactic 
interactions. 
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FIG. 5: T(d), the correlation coefficient between words of a particular class as a function of d, the distance in a vertex ordering 
provided by the spectral methods. Two series are shown for each word class: T(d) using AA T (black) and the V(d) for the 
scrambled ordering (gray). 
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FIG. 6: 1(d), the information between words of a particular class as a function of d, the distance in a vertex ordering provided 
by the spectral methods. Two series are shown for each word class: 7(d) using A (black) and 7(d) for the scrambled ordering 
(gray). 
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FIG. 7: 1(d), the information between words of a particular class as a function of d, the distance in a vertex ordering provided 
by the spectral methods. Two series are shown for each word class: 1(d) using AA T (black) and 1(d) for the scrambled ordering 
(gray). 



